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Abstract 

The design of complexity-aware cascaded detectors, 
combining features of very different complexities, is con¬ 
sidered. A new cascade design procedure is introduced, by 
formulating cascade learning as the Lagrangian optimiza¬ 
tion of a risk that accounts for both accuracy and complex¬ 
ity. A boosting algorithm, denoted as complexity aware cas¬ 
cade training (CompACT), is then derived to solve this op¬ 
timization. CompACT cascades are shown to seek an opti¬ 
mal trade-off between accuracy and complexity by pushing 
features of higher complexity to the later cascade stages, 
where only a few difficult candidate patches remain to be 
classified. This enables the use of features of vastly differ¬ 
ent complexities in a single detector. In result, the feature 
pool can be expanded to features previously impractical for 
cascade design, such as the responses of a deep convolu¬ 
tional neural netw’ork (CNN). This is demonstrated through 
the design of a pedestrian detector with a pool of features 
whose complexities span orders of magnitude. The result¬ 
ing cascade generalizes the combination of a CNN with an 
object proposal mechanism: rather than a pre-processing 
stage, CompACT cascades seamlessly integrate CNNs in 
their stages. This enables state of the art performance on 
the Caltech and KITTI datasets, at fairly fast speeds. 

1. Introduction 

Pedestrian detection is an important problem in com¬ 
puter vision. Many of its applications, e.g. smart vehicles 
or surveillance, require real-time detection. Since, under 
the popular sliding window paradigm, there are close to a 
million windows per 640x480 pixel image, detection com¬ 
plexity can easily become intractable. This is an imped¬ 
iment to the deployment of sophisticated classifiers, such 
as deep learning models, in the pedestrian detection arena. 
The most popular architecture for real-time object detection 
is the detector cascade of [32]. It exploits the fact that most 
image patches can be assigned to the background class by 
evaluation of a few simple cascade stages. This guarantees 
computational efficiency without compromising accuracy. 


since the few resulting false positives can be rejected by 
more complex detectors, in the late cascade stages. Given 
that these are rarely used, their complexity is not an imped¬ 
iment to high detection speeds. In result, it is possible to 
have both efficient and accurate detection. 

While the cascade detection principle is intuitive, its im¬ 
plementation is far from trivial. Early cascade designs re¬ 
quired extensive heuristics to determine the cascade config¬ 
uration [32, 35, 3], lacking the ability to explicitly optimize 
the trade-off between detection accuracy and complexity. A 
commonly used assumption is that all features have equiv¬ 
alent complexity. This significantly simplifies the design, 
which reduces to choosing the features that maximize de¬ 
tection accuracy. In fact, popular methods [3, 4] simply use 
a boosting algorithm (typically AdaBoost [8]) to design a 
non-cascaded classifier and then transform it into a cascade, 
by addition of thresholds. These approaches suffer from 
two main problems. First, they do not aim to select features 
that optimize the trade-off between detection accuracy and 
complexity. Second, the “equivalent feature complexity” 
hypothesis only produces sensible cascades when applied 
to features that indeed have similar complexity. This con¬ 
straint is, however, frequently violated [1, 23, 37], 

In fact, it has been remarkably difficult to accommo¬ 
date, in cascade learning, features significantly heavier than 
those in common use. This problem is particularly pressing 
given the recent success of deep learning in object recogni¬ 
tion [17, 29], The intractable computation of a deep learn¬ 
ing model under the sliding window paradigm is usually ad¬ 
dressed with recourse to object proposal mechanisms [31], 
giving rise to a two-stage cascade that is far from optimal, 
in terms of the trade-off between detection accuracy and 
speed. For pedestrian detection, object proposals are fre¬ 
quently implemented with weak pedestrian detectors, some¬ 
times cascaded detectors themselves [15]. Due to the ad- 
hoc nature of these solutions, deep learning models have 
not been competitive for pedestrian detection, contradicting 
their recognition and classification performance [17, 29], 

In this work, we address these problems by seeking an al¬ 
gorithm for optimal cascade learning under a criterion that 
penalizes both detection errors and complexity. For the lat- 
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ter, we introduce a measure of implementation complexity 
that allows the definition of a complexity risk akin to the 
empirical risk commonly used for classifier design. This 
makes it possible to define quantities such as complexity 
margins and complexity losses, and account for these in the 
learning process. We do this with recourse to a Lagrangian 
formulation, which optimizes for the usual classification 
risk under a constraint in the complexity risk. A boosting 
algorithm that minimizes this Lagrangian is then derived. 
This algorithm, denoted Complexity-Aware Cascade Train¬ 
ing (CompACT), is shown to select inexpensive features in 
the early cascade stages, pushing the more expensive ones 
to the later stages. This enables the combination of features 
of vastly different complexities in a single detector. These 
properties are demonstrated by the successful application of 
CompACT to the problem of pedestrian detection, using a 
pool of features ranging from Haar wavelets to deep convo¬ 
lutional neural networks (CNNs). 

Overall, this work makes three major contributions. 
First, it proposes a novel algorithm for learning a complex¬ 
ity aware cascade, so as to achieve an optimal trade-off be¬ 
tween accuracy and speed. To the best of our knowledge, 
this is the first algorithm to explicitly account for variable 
feature complexity in cascade learning, supporting weak 
learners of widely different complexities. Second, Com¬ 
pACT seamlessly integrates handcrafted and CNN features 
in a unified detector. This generalizes the object proposal ar¬ 
chitecture, guaranteeing the seamlessly integration of CNN 
stages with stages of any other complexity. Finally, a Com- 
PACT cascade for pedestrian detection is shown to achieve 
state of the art results on both Caltech [6] and KITTI [11], 
at faster speeds than the closest competitors. 

2. Related Works 

Detector cascades learned with boosting are commonly 
used for detecting template-like objects, e.g. faces [32, 

3, 35, 34], pedestrians [4, 25], or cars [26], Early ap¬ 
proaches used heuristics to find a cascade configuration of 
good trade-off between classification accuracy and com¬ 
plexity [32, 3, 35, 34], More recently, optimization of the 
accuracy-complexity trade-off has started to receive atten¬ 
tion [19, 25, 26, 38], [38] empirically added a complexity 
term to the objective function of RealBoost. [ 19, 25, 26] in¬ 
troduced the Lagrangian formulation that we adopt, but use 
a single feature family throughout the cascade. Since early 
cascades stages must be very efficient, this implies adopting 
simple weak learners, e.g. decision stumps. 

This has motivated extensive work on the design of effi¬ 
cient features. For pedestrian detection, the integral chan¬ 
nel features of [5] have recently become popular. They ex¬ 
tend the Haar-like features of [32] into a set of color and 
histogram-of-gradients (HOG) channels. More recently, a 
computationally efficient version of [32], denoted the ag¬ 


gregate channel features (ACF), has been introduced in [4], 
[23] complemented ACF with local binary patterns (LBP) 
and covariance features, for better detection accuracy. 

Several works proposed alternative feature channels, 
obtained by convolving different filters with the original 
HOG+LUV channels [36, 37, 1, 21], The SquaresChn- 
Ftrs of [1] reduce the large number of features of [5, 32] 
to 16 box-like filters of various sizes. [21] extended the 
locally decorrelated features of [ 1 3] to ACF, learning four 
5x5 PCA-like filters from each of the ACF channels. In¬ 
stead of empirical filter design, Zhang et al [36] exploited 
prior knowledge about pedestrian shape to design informed 
filters. They later found, however, that such filters are ac¬ 
tually not needed [37], Instead, the number of filters ap¬ 
pears to be the most important variable: features as simple 
as checkerboard-like patterns, or purely random filters, can 
achieve very good performance, as long as there are enough 
of them. Although reached state-of-the-art performance has 
been achieved [23, 37], they are relatively slow, due to the 
convolution computations with several hundred filters. 

While deep convolutional learning classifiers have 
achieved impressive results for general object detection 
[12, 14], e.g. on VOC2007 or ImageNet, they have not ex¬ 
celled on pedestrian detection [27, 22], Benchmarks like 
Caltech [6] are still dominated by classical handcrafted fea¬ 
tures (see e.g. a recent comprehensive evaluation of pedes¬ 
trian detectors by [2]). Recently, [15] transferred the R- 
CNN framework to the pedestrian detection task, showing 
some improvement over previous deep learning detectors 
[27, 22], However, the gap to the state of the art is still 
significant. Deep models also tend to be too heavy for slid¬ 
ing window detection. This is usually addressed with ob¬ 
ject proposal mechanisms [12, 33, 15] that pre-select the 
most promising image patches. This two-stage decompo¬ 
sition (proposal generation and classification) is a simple 
cascade mechanism. In this work, we consider the seamless 
combination of these two stages into a cascade explicitly de¬ 
signed to account for both accuracy and complexity, so as 
to achieve detectors that are both highly accurate and fast. 

3. Complexity-Aware Cascade Training 

In this section we introduce the CompACT algorithm. 

3.1. AdaBoost 

A decision rule h{x) = sign[F(x)\ of predictor F(x) 
maps a feature vector x G X to a class label y £ y = 
{—1,1}. Boosting learns a strong decision rule by combin¬ 
ing a set of weaker learners fk(x), 

F(x) = ^2fk( x )> (!) 

k 

using functional gradient descent on a classification risk 
[9, 20], AdaBoost [8] is based on the exponential loss 
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on training samples S t = {(xi,yi)}. Boosting iterations 
compute the functional derivative of ( 2 ) along the direction 
of weak learner g(x) at the location of the current predictor 

F{x), 
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where 
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The predictor is updated by selecting the steepest de¬ 
scent direction within a weak learner pool G = 

{01 (*))■•• ,9n(x)}, 

g*{x) = argmax < —51Z e [F], g > 
ge G 

= argmax — V y i w i g{x i ). (5) 
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The optimal step size for the update is 

a* = argmin7^s[F + ag*). 

a. 

For binary 9*{x), this has a closed form solution 
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Otherwise, the optimal step size is found by a line search. 


3.2. Complexity-Aware Learning 

Complexity-aware learning aims for the best trade-off 
between classification accuracy and complexity. This can 
be formulated as a constrained optimization problem, where 
classification risk is minimized under a bound on a com¬ 
plexity risk Rq [F], 

F*(x) = argmin Re[F] s.t. Rc[F] < 7 , ( 8 ) 
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with (j){v) = e~ v and £(y,F(x)) = yF(x). The function 
£(.) is the margin of example x under predictor F(.) and 
measures the confidence of the classification. Large posi¬ 
tive margins indicate that x is correctly classified with high 
confidence, large negative margins the same for incorrect 
classification, and a margin zero that the example is on the 
classification boundary. The loss is usually monoton- 
ically decreasing, penalizing all examples with less than a 
small positive margin. This forces the learning algorithm 
to concentrate on these examples, so as to produce as few 
negative margins as possible. The exponential loss of Ad- 
aBoost makes the penalty exponential on the confidence of 
incorrectly classified examples. 

In this work, we consider complexity risks of a similar 
form 

7lc[F]-T^yT[K(yi,F(xi))], ( 11 ) 
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where n[y,F(x)] is a measure of complexity for the clas¬ 
sification of example x under F(.) and r(.) a non-negative 
loss function that penalizes complexity. Drawing inspira¬ 
tion from the classification risk, we measure complexity 
with the complexity margin 

K[y,F(x)] = yn(F(x)), (12) 

where f2(F(x)) is a function of the time required to evaluate 
F(x), e.g. a number of machine operations or some other 
empirical measure of complexity. The complexity margin 
of ( 12 ) assigns positive (negative) complexity to positive 
(negative) examples, reflecting the fact that the computa¬ 
tion spent on negative examples is “wasted” or “negative” 
while that spent on positives is “justified” or “positive”. 
While positives have to survive all cascade stages, negatives 
should be rejected with little computation. The complexity 
loss t(v) then determines the complexity-aware behavior 
of learning algorithms. For example, a decreasing t(v) for 
v < 0 , penalizes negative examples of large complexity. 
This encourages classifiers that reject negatives with as little 
computation as possible. On the other hand, an increasing 
t( v) for u > 0 penalizes positives of large complexity. 

3.3. Embedded Cascade 

A cascaded classifier is implemented as a sequence of 
classification stages hi(x) = sgn[Fi(x) +Ti], where X) is a 
threshold. A popular architecture is the embedded cascade, 
whose predictor has the embedded structure. 


and is identical to the minimization of the Lagrangian 

C[F} = 7Ze[F\ + V 7Zc[F], (9) 


F fe (x) = F fc _i(x) + fk(x) = yfj(x). (13) 
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where 77 is a Lagrange multiplier that only depends on 7 . To 
define a complexity risk, we note that ( 2 ) can be written as 


Re[F] ~ 


In this paper, the cascade complexity is measured by the 
average per stage complexity, 

m 

n(F(x)) = — yv fc (x)fi(/ fc (x)), 
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where, using u[-] to denote the Heaviside step function, 
k -1 

r fc (*) = n «[*>(*) + r i]> (15) 
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is an indicator of examples that survive all stages prior to k, 
i.e. T-fe(x) = 1 if Fi(x) + Ti > 0, Vi < k, and rfc(x) = 0 
otherwise. Since the average complexity is bounded by the 
largest weak learner complexity, it leads to a more balanced 
Lagrangian in (9) than the total complexity. 


Combining (3), (16), and (17) and denoting r* = r m +i(xj), 
Wj = u)(yi,Xi), gi = g(xi), and tpi = i>(yi,Xi), this is the 
direction that maximizes 

1 1 ^ 

Note that the term ^ rn il{F(xi)) of (17) does not depend on 
g and plays no role in the optimization. The optimal step 
size for the update is 
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3.4. Cascade Boosting 

The minimization of (9) requires the functional deriva¬ 
tive of the Lagrangian along the direction of weak learner 
g(x) at the location of the current predictor F(x), 

< SC[F],g >=< SU E [F],g > +y < STZ c [F],g >, 

(16) 

where < 61Ze[F\, g > is as in (3). To compute the deriva¬ 
tive of the complexity risk we define u(e) as u(e) = 1 for 
e > 0 and u(e) = 0 otherwise, and write 


fl(F(x) + eg(x)) = 
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where C, m = 1 — anc | we have used (14). Since u(e) is 
not differentiable, it is approximated by u(e) w cr(e), where 
cr(e) is a differentiable function with <r(0) = 0, leading to 
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where 

ip{yi,Xi) = ~t' [yi£l(F(xi))]a'( 0). (18) 

Each boosting iteration updates F(x) with a step along 
the steepest descent direction of (16) within the weak 
learner learner pool G, 


g*{x) = argrnax < —6C[F],g > . 
ge G 


a* = argmin£[F + ag*}, (21) 

a 

and can be found by a line search. The cascade predictor is 
finally updated with 

F new (s) = F{x) + a*g* (x). (22) 

Note that, from (18), a'{ 0) is a constant that rescales all tpi 
equally. Hence, in (20), it can be absorbed into rj. Without 
loss of generality, we assume that cr'(0) = 1. This boosting 
algorithm is denoted the complexity aware cascade training 
(CompACT) boosting algorithm. 

3.5. Properties 

CompACT has a number of interesting properties. First, 
the contribution of each training example to the complexity 
term in (20) is multiplied by r,;. Hence, only examples that 
survive the current cascade F contribute to the complexity 
term. We refer to the Xi such that r* = 1 as active examples. 
Note that, given the set of active examples 


S a (F) = {{xi,yi) G S t \n = 1}, (23) 

associated with F, (20) can be replaced by 
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This complies with the intuition that examples which do not 
reach stage to + 1 during the cascade operation should not 
affect the complexity term for that stage. 

Second, most implementations of cascaded classifiers 
use weak learners of example-independent complexity, i.e. 
Q(g(xi)) = f l g ,Wi. While this does not hold for the cascade 
in general (different examples can be rejected at different 
stages), it holds for the examples in S a , i.e. f l(F(xi)) = 
{d,F,Vxi G S a - In this case, the complexity weights only 
depend on the label y E Defining ip + = — (ip~ = 
—t'\— flf]) as the value of ipi for positive (negative) exam¬ 
ples, and ttp ( 7 tp) as the percentage of negative (positive) 
active examples, (20) reduces to 

®W = i5f (25) 


(19) 





















with £p = 7r pip F — npipp. Since | S'a | decreases with cas- 

I S I 

cade length, the rescaling of 77 by gradually weakens 
the complexity constraint as the cascade grows. While in 
the early iterations there is pressure to select weak learners 
of reduced complexity, this pressure reduces as iterations 
progress. Gradually, complex weak learners are penalized 
less and the algorithm asymptotically reduces to a cascaded 
version of AdaBoost. This makes intuitive sense, since the 
latter cascade stages process a much smaller percentage of 
the examples than the earlier ones and have much less im¬ 
pact on the overall complexity. On the other hand, since 
the surviving examples are the most difficult to classify, ac¬ 
curate classification requires weak learner accuracy to in¬ 
crease with cascade length. This usually (but not always) 
implies that weak learner complexity increases as well be¬ 
cause powerful features usually require heavy computation. 
By pushing the complexity to the later stages, the algo¬ 
rithm can learn cascades that are both accurate and com¬ 
putationally efficient. This effect is reinforced by the fact 
that 1 /{m + 1 ) also decreases with cascade length. 

The loss t(v ) enables fine-tuning of this general be¬ 
havior, via In this work, we adopt the hinge loss 
t(v) = max( 0 , —v), for which x/tp = 1 ,i/jp = 0 and 
= 7 Tp. This assigns no penalty to the complexity of 
positive examples, encouraging CompACT to focus on the 
fast rejection of negatives. 

4. Pedestrian Detection 

This section discusses the proposed pedestrian detector. 

4.1. Feature Pools of Variable Complexity 

CompACT seeks the optimal trade-off between accuracy 
and complexity, at each cascade stage. This is most effec¬ 
tive when the feature pool is composed of features of vari¬ 
ous complexities. In the cascade literature, where most de¬ 
tectors use a single feature family, it is common practice to 
pre-compute a large number of feature responses at all im¬ 
age locations, before any detection takes place [21, 37, 23], 
This, however, has unfeasible complexity if the feature pool 
is very large (e.g. the 200,000~500,000 features proposed 
per patch in [37, 23]) or some features are computationally 
intense (e.g. the CNN features of [17, 29]). In these cases, 
it is neither tractable nor necessary to pre-compute all fea¬ 
tures at all image locations. For example, a cascade of 2048 
decision trees of depth 2, will evaluate at most 4096 fea¬ 
tures per patch. Since the cascade rejects most candidate 
patches after a few stages, the most intensive features (e.g. 
CNN) are unlikely to be needed at most image locations. 
Hence, while pre-computation is useful for low-complexity 
features, complex features should be evaluated as necessary. 
We refer to the former as pre-computed features and the lat¬ 
ter as computed just-in-time (JIT). 
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Figure 1. Eight 2x2 checkerboard-like filters used in this work. 
Red (Green) is used to represent value +1 (-1). 


4.1.1 Pre-computed Features 

Our pre-computed feature set consists of ACF [4], mostly 
due to its computational efficiency. Following [4], we 
extract 10 LUV+HOG channels. Since these are pre¬ 
computed, the complexity of using an ACF feature in any 
cascade stage is 1 . 


4.1.2 Just-in-time Features 


The JIT pool contains several feature subsets. The ability 
to weigh accuracy vs. computation enables CompACT to 
seamlessly combine these feature sets. 

SS: The self-similarity (SS) features of [28] capture the dif¬ 
ference between local patches and have achieved good per¬ 
formance on edge detection tasks [ 18, 7]. Following [ 18, 7], 
we compute SS features on a 12x6 grid of the 16x8 ACF 
/ 72\ 

patch. This results in f ^ I x 10 = 25, 560 SS features per 

patch. Since the computation of an SS feature involves 2 
ACF values, its complexity is 2. 

CB: Checkerboard features (CB) are the result of convolv¬ 
ing the ACF channels with a set of checkerboard filters. [37] 
has shown that a simple set of such features could achieve 
state-of-the-art performance for pedestrian detection. Based 
on their observation that the number of features determines 
performance (rather than feature type), we adopt the set of 
8 simple 2x2 checkerboard filters of Figure 1. A CB has 
implementation complexity of 4. 


LDA: Locally decorrelated HOG features, computed with 
linear discriminant analysis (LDA), have shown some su¬ 
periority for object detection over HOG features [13]. [21] 
showed that the computation of these features on ACF chan¬ 
nels leads to a big improvement over ACF. We adopt this 
feature family but, unlike [21], restrict the filter size to 3 x3. 
LDA features have complexity 9. 


CNN: In addition to operators defined over the ACF chan¬ 
nels, we consider a set of CNN features. The CNN is 
a smaller version of the popular model of [17], with five 
convolutional layers and one fully connected layer. The 
CNN is applied to 64x64 image patches, the first convo¬ 
lutional layer has 32 filters, the remaining four have 64, 
and the fully connected layer consists of 1024 hidden units. 
All convolutional filters have size 3x3, and stride 1. The 
CNN model was originally trained with the ILSVRC14- 
DET dataset [24], using the cropped object patches, and 
then fine tuned on the target pedestrian dataset. For feature 



extraction, we only use the output of the 5 th convolutional 
layer, which can be seen as CNN feature channels, similar 
to ACF. These features are denoted as CNN. Inspired by the 
good performance and simplicity of the checkerboard fea¬ 
tures on ACF, we also compute them on the conv5 feature 
channels. These are denoted CNNCB features. 

The complexity of CNN features is of a different nature 
than that of ACF features. First, the implementation on a 
different processor (GPU instead of CPU) makes the direct 
comparison of number of operations meaningless. Second, 
while the CNN features are computed on an “as needed” 
basis, the structure of the network makes it inefficient to 
compute each feature individually. If the CNN features are 
needed to classify a certain image window, it is significantly 
more efficient to compute the 5 th layer responses over the 
whole window than repeatedly applying the network to sub¬ 
window regions. We account for these difficulties by setting 
a trigger complexity CIcnn for CNN features. That is, in 
(25), CNN features have f 1 g = f 1c nn if no CNN feature 
has been used by the previous cascade stages to classify the 
current patch. Once the CNN features are computed, the 
complexity of using any CNN feature is 1, similar to ACF, 
while CNNCB features have complexity 4. 

4.2. Embedding Large CNN Models 

Large CNN models [17, 29] are now popular in computer 
vision. However, the use of these models in Comp ACT is 
challenging, due to the computational cost of embedding 
them in the iterative boosting algorithm. Our attempts to 
do so revealed impractical. Instead, we limited the use of 
a large CNN to th e final cascade stage. Upon learning the 
cascade, we simply used a large CNN classifier as the final 
weak learner g of (22). Note that this has no loss of opti¬ 
mality, since a was learned with (21). The CNN is simply a 
descent direction of (19) unavailable to prior stages. It dif¬ 
fers from the standard proposal+CNN approach in that 1) 
not only the bounding boxes but also the confidence scores 
of the cascade are forwarded to the deep CNN stage, and 2) 
the combination of the proposal mechanism (cascade) and 
large CNN is optimal under the well defined risk of (9). 

In our implementation, we considered both the Alex [ 1 7] 
and VGG [29] models. Previous implementations [12, 15] 
have warped cropped patches to size 227x227. How¬ 
ever, such large patches are computationally expensive. We 
adopted the convolutional layers from the pre-trained mod¬ 
els and two (randomly initialized) fully connected layers of 
2048 units each. These networks were fine tuned to the 
pedestrian datasets using Caffe [16]. This allowed us to 
use the canonical 128x64 size for the pedestrian template. 
For Alex-Net, we used a convolution stride of 2 on the first 
layer, instead of 4 in the original model. For VGG-Net, 
we used all aspects of the original configuration other than 
input size and fully connected layers. While the original 
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Figure 2. Stage configuration of the proposed CompACT cascade 
(blue) and the manually set cascade (green). Only one in five (fifty) 
stages is shown for the CompACT (manual) cascade. 

VGG-Net is approximately 8 times slower than the Alex- 
Net, the modified VGG-Net is only twice as slow. 

5. Experiments 

Various experiments were performed to evaluate the per¬ 
formance of CompACT cascades. All times reported are for 
implementation on a single CPU core (2.10GHz) of an In¬ 
tel Xeon E5-2620 server with 64GB of RAM. An NVIDIA 
Tesla K40M GPU was used for CNN computations. 

5.1. Cascade Configuration 

We started by learning a CompACT cascade on the 
Caltech pedestrian dataset, using the set up of [4], 
The cascade used 2048 decision trees of depth 2, and 
was bootstrapped 6 times during training, after stages 
{32,128, 256, 512,1024,1536}, using the procedure of 
[ 10, 30], Figure 2 presents the configuration of the learned 
cascade, showing how features of different complexities 
were chosen at different stages. ACF features, which are 
the cheapest, were the only selected for the first 200 stages, 
and rarely chosen after stage 500. This suggests that the 
these features are very efficient but not very discriminant. 
A better trade-off between these two goals is achieved by 
the SS features, which were selected throughout the training 
process. It is particularly interesting that these features are 
competitive even for the later cascade stages. This suggests 
that they can be very discriminant despite their simplicity. 
Similarly, CB features were selected across a large range of 
cascade stages. This is unlike LDA features, which were 
rarely selected. These features do not appear to achieve 
a good trade-off between discrimination and complexity. 
More surprisingly, the CNN features were also rarely se¬ 
lected, with CNNCB dominating the late cascade stages. 
This suggests that the CNNCB representation is more dis- 









Table 1. Comparison to single-feature cascades (MR: log-average 
miss-rate). 


Method 

Single Type 

CompACT 

ACF 

SS 

CB 

LDA 

CNN 

CNNCB 

ACF 

CNN 

MR 

42.6 

34.29 

37.89 

37.15 

28.07 

26.93 

32.15 

23.82 

time (s) 

0.07 

0.08 

0.23 

0.16 

0.87 

2.05 

0.11 

0.28 


Table 2. Comparison to multiple-feature cascades. 


Method 

ACF-based 

ACF-based+Small CNN 

Boosting Manual CompACT 

Boosting Manual CompACT 

MR 
time (s) 

33.06 36.08 32.15 

0.41 0.11 0.11 

22.37 25.46 23.82 

2.69 0.28 0.28 


criminant. Recall that, while the CNN features are a little 
more efficient, CompACT boosting weighs complexity less 
heavily than discrimination in the late cascade stages. 

5.2. Cascade Comparison 

The CompACT cascade of the previous section was com¬ 
pared to cascades of other architectures. Table 1 presents a 
comparison to the predominant architecture in the literature: 
cascades of a single feature type. In this case, the complex¬ 
ity penalty of (25) is equal for all weak learners, and Com¬ 
pACT reduces to standard boosting. This was used to pro¬ 
duce “standard” cascades of ACF, SS, CB, LDA, CNN and 
CNNCB features. We start by noting that the implemented 
ACF outperforms [4]. This is due to the use of a different 
bootstrapping strategy. Clearly, SS outperforms the other 
ACF-based features (ACF, CB, and LDA), achieving higher 
accuracy and speed. This confirms Figure 2, where SS fea¬ 
tures were selected throughout the detector. CB and LDA 
are more discriminant than ACF, but have higher complex¬ 
ity. CNN features have higher accuracy than all ACF-based 
features at the cost of a ten-fold increase in complexity over 
ACF. Finally, CNNCB has the best detection results, but 
only a marginal gain over CNN and much higher computa¬ 
tion. When compared to CompACT cascades, all single fea¬ 
ture cascades perform poorly. CompACT-ACF, which is re¬ 
stricted to ACF-based features, has higher accuracy than all 
ACF-based single feature cascades and is faster than most. 
CompACT-CNN, which includes all features, has the best 
detection performance. Note that not only its detection per¬ 
formance is clearly superior to the best single-feature cas¬ 
cade (CNNCB) but it is also 10 times/asfer. 

Table 2 presents a comparison to cascades that combine 
multiple features. “Boosting” is a cascade learned without 
complexity constraints (77 = 0 in (25)). This is equiva¬ 
lent to applying existing cascade learning algorithms to the 
diverse feature set considered in this work. “Manual” is 
an attempt to “hand-code” the behavior of CompACT, by 
restricting the boosting algorithm without complexity con¬ 
straint (77 = 0 ) to use certain types of features in different 
cascade stages. This restriction is based on feature com¬ 
plexity, as illustrated in Figure 2. The features were ranked 
by complexity and used sequentially, each feature type be¬ 


Table 3. Performance of CompACT cascades using large CNNs. 


Method 

CompACT 

Proposals 

Intermediate 

Embedded 

Alex 

VGG 

Alex 

VGG 

Alex 

VGG 

MR 

18.92 

19.59 

14.77 

16.18 

13.71 

14.96 

11.75 

time (s) 

0.25 

+0.01 

+0.03 

+0.01 

+0.03 

+0.1 

+0.25 


ing used in approximately 400 stages. The two sides of Ta¬ 
ble 2 differ in that only ACF-based features were used on 
the left, while both these and the small CNN model were 
used on the right. In both cases, the “manual” cascade has 
low complexity but poor accuracy. “Boosting,” on the other 
hand, can produce a more accurate cascade. The price is, 
however, a significant increase in complexity. CompACT 
achieves the best trade-off between accuracy and complex¬ 
ity. Note also the introduction of the small CNN model en¬ 
ables substantially better cascades, as long as a complexity 
penalty is assigned to it during learning. 

5.3. Large CNN models 

While the previous experiments only use small models, 
a number of experiments were performed with large mod¬ 
els. These experiments were performed on both Caltech and 
KITTI, in both cases using cascades of 4096 decision trees 
of depth 5. These were bootstrapped 9 times, after stages 
{32,128, 256, 512,1024,1536,2048, 2560, 3328}. For 
Caltech, we used the training set size of [21], and the tem¬ 
plate size 64x32 as in [4], On KITTI, test images were 
upsampled by 2 to detect pedestrians of height 25. This en¬ 
abled the use of a single template size. After upsampling, 
the detected bounding boxes (minimum height of 50) had 
twice the actual object size. They were rescaled down by a 
factor of 2 . 

Table 3 compares the performance of the CompACT cas¬ 
cade with small CNNs (denoted CompACT) with several 
variants for the inclusion of large CNNs. In all these vari¬ 
ants, the large CNN is computed only on windows selected 
by CompACT. The times noted as ”+” reflect the added cost 
of running the image patches through it. The “Proposal” 
columns report to the use of the CompACT cascade as a 
proposal mechanism [12, 15] for the CNN. The “Embed¬ 
ded” columns report to the use of the large CNN as the last 
stage of the cascade, as discussed in Section 4.2. Finally, the 
“Intermediate” columns report to an intermediate between 
these two architectures. As with proposals, the large CNN 
stage was only applied to the CompACT output, after non¬ 
maximum suppression (NMS). However, the prediction was 
that of (22), i.e. the CNN and CompACT scores were com¬ 
bined, using the coefficient a learned by boosting. 

A number of interesting conclusions are possible. First, 
under the proposal architecture, only VGG improved on the 
CompACT cascade. For Alex, there was no benefit. This 
shows that the CompACT cascade is already a very good 
classifier. Second, the embedding of the large CNN on 
the CompACT model achieved the best results in all cases. 























KITTI Pedestrian (moderate) 



Figure 3. Comparison to state-of-the-art on Caltech (reasonable). 

This shows that the ComPACT cascade score contains in¬ 
formation that complements that of the CNN scores. For 
both CNN models, it was better to combine scores with the 
CompACT cascade than to consider the latter simply as a 
proposal mechanism. Finally, the theoretically more sound 
embedding of the large CNN before NMS (’’Embedding”) 
always produced higher detection accuracy than the com¬ 
bination after NMS (“Intermediate”). This, however, had 
substantially less computation, since the number of bound¬ 
ing boxes is approximately 10 times smaller after NMS. 

5.4. Comparison with the state-of-the-art 

Figure 3 compares two CompACT pedestrian detectors 
to the state of the art on Caltech. CompACT refers to the 
model using “ACF + small CNN features”, and CompACT- 
Deep to the model with the embedded VGG model in 
the last stage. CompACT achieves state-of-the-art perfor¬ 
mance, close to [37]. Note that the competing detectors 
- Katamari [2] and SpatialPooling+ [23] - combine many 
features (HOG, LBP, spatial covariance, optical flow, mul¬ 
tiple detectors, etc.) and are all quite slow. The same holds 
for the state-of-the-art implementation of Checkerboards, 
which requires a large number of filter channels [37], On 
the other hand, CompACT runs at 4 fps on a relatively slow 
processor. The CompACT-Deep cascade performs even bet¬ 
ter - 7 points better than the state-of-the-art [37] and 11 
points better than the best deep pedestrian detector [15]! 
CompACT-Deep runs at 2fps and is faster than the com¬ 
peting detectors [2, 23, 37]. 

Figure 4 and Table 4 summarize performance on KITTI. 
Since test images are larger than in Caltech, running times 
are higher on this dataset. Nevertheless, the CompACT cas¬ 
cade is the fastest of all the state-of-the-art detectors. Note 
that it uses approximately the same number of feature chan¬ 
nels (including the CNN model) as pAUCEnsT [23] and 
FilteredICF [37], which are both much less accurate and 



recall 

Figure 4. Comparison to state-of-the-art on KITTI Pedestrian 
(moderate). 

Table 4. Comparison to state-of-the-art detectors on KITTI. Note: 
* ignores the time needed to compute object proposals. 


Methods 

Easy 

Moderate 

Hard 

Time (s) 

DPM 

45.50 

38.35 

34.78 

10 

DA-DPM 

56.36 

45.51 

41.08 

21 

RCNN 

61.61 

50.13 

44.79 

4 

FilteredICF 

61.14 

53.98 

49.29 

40 

pAUCEnsT 

65.26 

54.49 

48.60 

60 

regionlet 

73.14 

61.15 

55.21 

1* 

CompACT 

65.35 

54.92 

49.23 

0.75 

CompACT-Deep 

70.69 

58.74 

52.71 

1 


slower. R-CNN [15, 12], the only CNN detector on KITTI, 
is also substantially weaker than CompACT-Deep (differ¬ 
ence larger than 8 points). Overall, the only approach com¬ 
petitive with the CompACT-Deep cascade is the Regionlets 
method of [33]. However, this work only reports classifica¬ 
tion times, excluding the time needed to generate proposals, 
which can be on the order of several seconds. This is equiv¬ 
alent to only accounting for the processing time of the last 
stage of the CompACT-Deep model, which is 0.25 second. 

6. Conclusion 

In this work, we proposed the CompACT boosting al¬ 
gorithm for learning complexity-aware detector cascades. 
By optimizing classification risk under a complexity con¬ 
straint, CompACT produces cascades that push features of 
high complexity to the later cascade stages. This has been 
shown to enable the seamless integration of multiple fea¬ 
ture families in a unified design. This integration extends to 
features, such as deep CNNs, that were previously beyond 
the realm of cascaded detectors. The proposed CompACT 
cascades also generalize the popular combination of object 
proposals+CNN, which they were shown to outperform. Fi¬ 
nally, we have shown that a pedestrian detector learned by 
application of CompACT to a diverse feature pool achieves 
state-of-the-art detection rates on Caltech and KITTI, with 
much faster speeds than competing methods. 
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